-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary / row helpers #6096
base: master
Are you sure you want to change the base?
Binary / row helpers #6096
Conversation
Currently these are accessible via `AsRef`, but that trait only gives you the bytes with the lifetime of the `Row` struct and not the lifetime of the backing data.
arrow-row/src/lib.rs
Outdated
/// Create a [BinaryArray] from the [Rows] data without reallocating the | ||
/// underlying bytes. | ||
pub fn into_binary(self) -> BinaryArray { | ||
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should return LargeBinaryArray
here as the offsets are Vec<usize>
(not Vec<u32>
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to it! Since array indices are signed ints, we'd need this assert in any case, so I went with what seemed like the most common type. I don't feel especially strongly about it though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree Binary
is likely the most common type. We could potentially add a to_large_binary
to support converting to LargeBinary 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bkirwi and @XiangpengHao
I think this PR needs some additional negative tests and error testing but otherwise I think it is looking good to me
cc @tustvold in case you have time to comment on the safety of the design
self.buffer.len() <= i32::MAX as usize, | ||
"rows buffer too large" | ||
); | ||
let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to check that the offsets don't overflow a i32 - this should be a try_into
I think and the method should be like fn try_into_binary(self) -> Result<BinaryArray>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My belief is that this is guaranteed by the assert above (which asserts that the len is not larger than i32::MAX
) and the existing offset invariant (which guarantees that all offsets are valid indices into the binary data). So a more expensive O(n) check seemed redundant.
I'll go ahead and turn that assert into a Result::Err
; let me know what you think about the other side of it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @bkirwi 's logic here. If we assume that self.offsets
is well formed with respect to self.buffer
then we shouldn't need to check the individual offsets.
arrow-row/src/lib.rs
Outdated
/// Create a [BinaryArray] from the [Rows] data without reallocating the | ||
/// underlying bytes. | ||
pub fn into_binary(self) -> BinaryArray { | ||
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree Binary
is likely the most common type. We could potentially add a to_large_binary
to support converting to LargeBinary 🤔
@@ -738,6 +738,23 @@ impl RowConverter { | |||
} | |||
} | |||
|
|||
/// Create a new [Rows] instance from the given binary data. | |||
pub fn from_binary(&self, array: BinaryArray) -> Rows { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this function needs to be marked unsafe
-- I am worried that someone inserts invalid data into Rows
here (e.g. modifies the bytes to read in invalid UTF8).
However, I see that there is already a way to convert between Rows
and [u8]
and then from [u8]
to Rows
(e.g RowParser::parser
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's my understanding! This would definitely be unsafe without the validate_utf8
below... but with it, I believe this has the same safety properties as existing public API.
for (actual, expected) in back.iter().zip(&arrays) { | ||
actual.to_data().validate_full().unwrap(); | ||
dictionary_eq(actual, expected) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add some negative tests (like create a BinaryArray from some random bytes and try to convert that back to an array and make sure it panics)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I notice there's just one existing test for the parser, for utf8 data; I've matched that and added a couple more tests for interesting cases.
(This seems like great API surface to fuzz... but it's challenging to write a real fuzzer for, since panics are expected and miri is disabled for our existing fuzzer. May be interesting future work!)
@@ -738,6 +738,23 @@ impl RowConverter { | |||
} | |||
} | |||
|
|||
/// Create a new [Rows] instance from the given binary data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a doc example showing how to do this?
I think trying to give a basic example of converting rows, then to/from binary, will not only serve as good documentation it will make sure all the required APIs are pub (for example, I think RowParser
needs to be pub..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
Marking as draft so it is clear this PR isn't waiting on feedback anymore (at least I don't think it is). Please mark it as ready for review when it is ready for another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I think I've addressed all comments, though there were a couple things I wasn't certain of - addressed inline.
@@ -738,6 +738,23 @@ impl RowConverter { | |||
} | |||
} | |||
|
|||
/// Create a new [Rows] instance from the given binary data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
@@ -738,6 +738,23 @@ impl RowConverter { | |||
} | |||
} | |||
|
|||
/// Create a new [Rows] instance from the given binary data. | |||
pub fn from_binary(&self, array: BinaryArray) -> Rows { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's my understanding! This would definitely be unsafe without the validate_utf8
below... but with it, I believe this has the same safety properties as existing public API.
self.buffer.len() <= i32::MAX as usize, | ||
"rows buffer too large" | ||
); | ||
let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My belief is that this is guaranteed by the assert above (which asserts that the len is not larger than i32::MAX
) and the existing offset invariant (which guarantees that all offsets are valid indices into the binary data). So a more expensive O(n) check seemed redundant.
I'll go ahead and turn that assert into a Result::Err
; let me know what you think about the other side of it!
for (actual, expected) in back.iter().zip(&arrays) { | ||
actual.to_data().validate_full().unwrap(); | ||
dictionary_eq(actual, expected) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I notice there's just one existing test for the parser, for utf8 data; I've matched that and added a couple more tests for interesting cases.
(This seems like great API surface to fuzz... but it's challenging to write a real fuzzer for, since panics are expected and miri is disabled for our existing fuzzer. May be interesting future work!)
(Looks like there was some merge skew in the tests; I've merged the main branch in here which ought to fix it.) |
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments but these seem like straightforward changes to me.
buffer: array.values().to_vec(), | ||
offsets: array.offsets().iter().map(|&i| i.as_usize()).collect(), | ||
config: RowConfig { | ||
fields: Arc::clone(&self.fields), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More for my curiosity than anything but why Arc::clone(&self.fields)
instead of self.fields.clone()
?
/// | ||
/// // We can convert rows into binary format and back in batch. | ||
/// let values: Vec<OwnedRow> = rows.iter().map(|r| r.owned()).collect(); | ||
/// let binary = rows.try_into_binary().expect("small"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a little confused by .expect("small")
. What does "small" mean in this context? Why not just .unwrap()
?
self.buffer.len() <= i32::MAX as usize, | ||
"rows buffer too large" | ||
); | ||
let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @bkirwi 's logic here. If we assume that self.offsets
is well formed with respect to self.buffer
then we shouldn't need to check the individual offsets.
(ah, I see rust fmt is failling, probably need CI passing before merge) |
@bkirwi can you please fix the CI tests so we can merge this PR? Thank you @westonpace for the review |
Thanks for the review! I should be able to get to the follow-up later this
week.
…On Tue, Sep 24, 2024 at 11:12 Andrew Lamb ***@***.***> wrote:
@bkirwi <https://github.com/bkirwi> can you please fix the CI tests so we
can merge this PR?
Thank you @westonpace <https://github.com/westonpace> for the review
—
Reply to this email directly, view it on GitHub
<#6096 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMFXMZRRAZP444R7XGRAVDZYF6N3AVCNFSM6AAAAABLFIXZL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZRGU4TENJWG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
/// let parsed: Vec<OwnedRow> = | ||
/// binary.iter().flatten().map(|b| parser.parse(b).owned()).collect(); | ||
/// assert_eq!(values, parsed); | ||
/// ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is worth commenting here when this will return an error (aka when the data is more than 2GB)?
@@ -738,6 +738,42 @@ impl RowConverter { | |||
} | |||
} | |||
|
|||
/// Create a new [Rows] instance from the given binary data. | |||
/// | |||
/// ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may also be worth adding a doc comment here about when this API will panic (when the data passed in was invalid or empty)
Which issue does this PR close?
Closes #6063.
(Potentially - still under discussion at the linked issue!)
Rationale for this change
I've added the optional
from_binary
method discussed in the associated issue also.What changes are included in this PR?
data
,into_binary
andfrom_binary
functions, and an extension to the fuzz test that checks the data survives the roundtrip.Are there any user-facing changes?
Yes, though I suspect the rustdoc covers them enough?